Skip to content

Measure startup, each bench run and shutdown separately #13162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

mvorisek
Copy link
Contributor

@mvorisek mvorisek commented Jan 16, 2024

Individual times per each run should be collected to plot data with robust statistics (like confidence intervals).

@mvorisek mvorisek force-pushed the improve_bench branch 2 times, most recently from 9411ff0 to 082bacd Compare January 16, 2024 17:40
@mvorisek mvorisek marked this pull request as ready for review January 16, 2024 18:19
$profileOutSpecific = $profileOut . '.' . $k;

runCommand([
'callgrind_annotate',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how to convince callgrind_annotate to resolve all symbols, especially all the onces defined in Sqlite lib? libsqlite3-dev is installed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would need to compile with CFLAGS="-g -O2" to get debug symbols. But with multiple repetitions we don't generally care about startup/shutdown. It's not very meaningful for applications ran with opcache. If you want to provide startup/shutdown you can do so without repetitions.

Copy link
Contributor Author

@mvorisek mvorisek Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

in the example you can see some examples of unresolved functions (from callgrind.symfony-demo.0 profile). Will CFLAGS="-g -O2" really help if other functions are resolved well?

All unresolved functions are listed in the end of each annotated profile file (like callgrind.symfony-demo.0.txt).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for SQLite, since we're not compiling its source files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug symbols for Debian packages are commonly available via the corresponding -dbgsym package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Ubuntu docker, after apt update, I get only these:

# apt-cache search libsqlite
libsqlite3-0 - SQLite 3 shared library
libsqlite3-dev - SQLite 3 development files
libsqlite3-mod-ceph - SQLite3 VFS for Ceph
libsqlite3-mod-ceph-dev - SQLite3 VFS for Ceph (development files)
libsqlite-tcl - SQLite 2 Tcl bindings
libsqlite0 - SQLite 2 shared library
libsqlite0-dev - SQLite 2 development files
libsqlite3-gst - SQLite bindings for GNU Smalltalk
libsqlite3-mod-blobtoxy - SQLite3 extension module for read-only BLOB to X/Y mapping
libsqlite3-mod-csvtable - SQLite3 extension module for read-only access to CSV files
libsqlite3-mod-impexp - SQLite3 extension module for SQL script, XML, JSON and CSV import/export
libsqlite3-mod-rasterlite2 - SQLite 3 module for huge raster coverages
libsqlite3-mod-spatialite - Geospatial extension for SQLite - loadable module
libsqlite3-mod-virtualpg - Loadable dynamic extension to both SQLite and SpatiaLite
libsqlite3-mod-xpath - SQLite3 extension module for querying XML data with XPath
libsqlite3-mod-zipfile - SQLite3 extension module for read-only access to ZIP files
libsqlite3-ocaml - Embeddable SQL Database for OCaml Programs (runtime)
libsqlite3-ocaml-dev - Embeddable SQL Database for OCaml Programs (development)
libsqlite3-tcl - SQLite 3 Tcl bindings
libsqliteodbc - ODBC driver for SQLite embedded database

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!! That link brought me to https://valgrind.org/docs/manual/manual-core.html#manual-core.debuginfod . Commited in d4fe3ef.

Now I still see some missing source for the core linux libs:


--------------------------------------------------------------------------------
The following files chosen for auto-annotation could not be found:
--------------------------------------------------------------------------------
  ./csu/../csu/libc-start.c
  ./csu/../sysdeps/nptl/libc_start_call_main.h
  ./csu/./csu/errno-loc.c
  ./ctype/../include/ctype.h
  ./debug/./debug/explicit_bzero_chk.c
  ./elf/../sysdeps/generic/dl-protected.h
  ./elf/../sysdeps/generic/ldsodefs.h
  ./elf/../sysdeps/x86_64/dl-machine.h
  ./elf/../sysdeps/x86_64/dl-runtime.h
  ./elf/../sysdeps/x86_64/dl-trampoline.h
  ./elf/./elf/dl-lookup.c
  ./elf/./elf/dl-runtime.c
  ./io/../sysdeps/posix/getcwd.c
  ./io/../sysdeps/unix/syscall-template.S
...
  ./io/../sysdeps/unix/sysv/linux/write.c
  ./libio/./libio/fileops.c
  ./libio/./libio/iofflush.c
  ./libio/./libio/libioP.h
  ./malloc/./malloc/arena.c
  ./malloc/./malloc/malloc.c
  ./malloc/./malloc/morecore.c
  ./misc/../sysdeps/unix/syscall-template.S
  ./misc/../sysdeps/unix/sysv/linux/brk.c
  ./misc/../sysdeps/unix/sysv/linux/x86_64/syscall.S
  ./misc/./misc/sbrk.c
  ./misc/./misc/syslog.c
...
  ./stdlib/../stdlib/strtol.c
  ./stdlib/../stdlib/strtol_l.c
  ./stdlib/./stdlib/getenv.c
  ./string/../sysdeps/x86_64/multiarch/memchr-avx2.S
...
  /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/main.c
  /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/malloc.c
  /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/mem1.c
  /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/memjournal.c
  /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/mutex.c
  /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/mutex_unix.c
  /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/notify.c
  /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/os.c
...

any idea?

@iluuu1994
Copy link
Member

Is there any evidence that picking the fastest run is the most accurate representation for our benchmark? It would seem that this heavily depends on the distribution / outliers.

@mvorisek
Copy link
Contributor Author

mvorisek commented Jan 19, 2024

I will collect some data using Github Actions. In general, the lowest time is usually good metric, as it is bound by the fastest reality and immune to random delays caused by undetermistic events like context switches, allocations, resources delays, ...

So far I have been able to identify the reason why Symfony and Wordpress times are unstable.

@iluuu1994
Copy link
Member

iluuu1994 commented Jan 19, 2024

I will collect some data using Github Actions.

That would be great. We should demonstrate how much more stable the benchmark actually becomes this way.

In general, the lowest time is usually good metric, as it is bound by the fastest reality and immune to random delays caused by undetermistic events like context switches, allocations, resources delays, ...

I suppose that should be ok. @nielsdos Any thoughts on this?

In case this is the approach we want to go with: I don't think the approach of dumping stats for every iteration is great, because it makes Valgrind unusable without this benchmarking script. It would be better if we could track the fastest run internally and dump it at the end, but unfortunately Valgrinds API is quite lacking.

Sqlite is using malloc heavily and the impl. is not deterministic, as some allocations are internally reallocated based on the allocation alignment

We found the same in #12130 and iluuu1994#99, but the mitigations were not satisfactory. Depending on the layout of the allocated data, SSE can be used more effectively.

I do wonder whether it may be worth checking if the the code could be optimized by allocating a bit more memory and aligning it manually to make use of SSE. This would solve the benchmark and potentially improve performance. If this happens primarily in sqlite this may be out of our control.

@nielsdos
Copy link
Member

I suppose that should be ok. @nielsdos Any thoughts on this?

Not an expert, I'd have to refresh my stats course, but anyway on first sight:

Noise can only make things slower, never faster. So to stabilize benchmarks it can indeed make sense to use the minimum, as is often done. But it can give you false confidence in the results if noise is common for the benchmarked program (as is often the case for I/O heavy programs).
If the "underlying distribution" that the run time follows are not heavily skewed then with enough iterations they should tend to a normal distribution, in which case reporting the average + standard deviation is useful.

Anyway, I think minimum could be fine, but it would be nice to know the variance and what the underlying distribution looks like.

@mvorisek
Copy link
Contributor Author

mvorisek commented Jan 21, 2024

I run 500 benchmarks of each demo.


image

In the figure above (showing first few run times excl. startup) you can see that tracking JIT needs about 10 runs to stabilize the Ir (instructions) needed. Non-JIT is stable in about 2, 3 runs.


image

In Symfony JIT and JIT/non-JIT Wordpress, there are distant outliers.


image

Distribution of Symfony JIT after 12.5% percentile of outliers removed.


In did a few experiments with trimmer mean but all the results were almost the same as median - so let's use median for reporting.

@mvorisek mvorisek marked this pull request as draft January 21, 2024 18:03
@mvorisek mvorisek force-pushed the improve_bench branch 2 times, most recently from 71088d9 to a978f8c Compare January 22, 2024 11:01
@mvorisek mvorisek marked this pull request as ready for review January 22, 2024 11:48
@mvorisek
Copy link
Contributor Author

mvorisek commented Jan 22, 2024

Before this PR, when there were no warmup_repeats (for Zend/bench.php benchmark), the startup (before 1st run) was measured. This is quite non-consistent in sense of result definition, but for the chart results, I have kept this.

In following PR, I will improve the benchmarking more to collect avg and sd for repeated startup/shutdown as well as storing all these data in benchmarking repo.

For now, let me know if there are more questions to this PR. It prelands some changes and makes the single Ir reported value more stable:

  • 1.1% more stable for Symfony
  • 8.5% more stable for Symfony JIT
  • 18.8% more stable for Wordpress
  • 22% more stable for Wordpress JIT
  • for bench Zend/bench.php the results are the same, as run only once

Copy link
Member

@iluuu1994 iluuu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating! Can you clarify what you mean by "x% more stable"? How did you gather these numbers?

Since you say that the results are similar to the trimmed mean, and we're already using the mean (untrimmed, but primed) I wouldn't expect there to be a big difference.

/* measure shutdown separately */
CALLGRIND_DUMP_STATS;
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I don't think dumping stats to stdout/stderr repeatedly is a good solution, because it makes Valgrind useless for all other use-cases where this script is not used. A better solution would be if we could calculate the mean or whatever metric we find most useful, from C, and output that when the script terminates.

Valgrind allows specifying the output file using --log-file="filename". This would allow accessing the output from C. Unfortunately, there seems to be no way to set this programmatically, so it would have to be set manually when invoking PHP with Valgrind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, --callgrind-out-file= cannot be accessed/detected as it is not exported.

What do you want to achive, does CALLGRIND_DUMP_STATS dump stats even if callgrind is not enabled? Also notice, all CALLGRIND_DUMP_STATS are called only if CGI benchmark is (-T) is enabled.

A better solution would be if we could calculate the mean or whatever metric we find most useful, from C, and output that when the script terminates.

This is not possible with Valgrind and also probably not wanted, as having multiple Callgrind files is helpful to investigate time differences.

Copy link
Member

@iluuu1994 iluuu1994 Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that Dmitry and others use valgrind php-cgi -Tn ... for things other than Symfony/Wordpress/etc. However, in this case the script will now dump the stats in every iteration, requiring manually extracting the values and calculating some mean. So the output becomes essentially useless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 9d77734 I have put the CALLGRIND_DUMP_STATS call in an env. var condition. If the env. var is not present, nothing is dumped during the run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iluuu1994 can I ask you to review this PR

@@ -63,7 +63,8 @@ function runSymfonyDemo(bool $jit): array {
cloneRepo($dir, 'https://github.com/php/benchmarking-symfony-demo-2.2.3.git');
runPhpCommand([$dir . '/bin/console', 'cache:clear']);
runPhpCommand([$dir . '/bin/console', 'cache:warmup']);
return runValgrindPhpCgiCommand('symfony-demo', [$dir . '/public/index.php'], cwd: $dir, jit: $jit, warmup: 50, repeat: 50);

return runValgrindPhpCgiCommand('symfony-demo', [$dir . '/public/index.php'], cwd: $dir, jit: $jit, repeat: 50);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the warmup would negatively impact the statistical mean. You should get more accurate results if you keep the warmup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CGI warmup is the same as normal CGI repeat with an exception how time is reported. This is not relevant to the Ir/Valgrind benchmarking.

If you mean solely to not measure first 50 runs and then measure following 50 runs, this is not needed, as I have verified the results and discussed them in #13162 (comment) - basically non-JIT is Ir stable after 2 runs and JIT Ir is stable after about 10 runs.

@mvorisek mvorisek marked this pull request as draft January 22, 2024 13:41
@mvorisek
Copy link
Contributor Author

mvorisek commented Jan 22, 2024

Thanks for investigating! Can you clarify what you mean by "x% more stable"? How did you gather these numbers?

Since you say that the results are similar to the trimmed mean, and we're already using the mean (untrimmed, but primed) I wouldn't expect there to be a big difference.

I compared current mean (of 490 datapoints for each benchmark, first 10 were always excluded) with trimmed mean and divided that by standard deviation. The differences are thus relative to sd and explained by the outliers shown in the charts above.

@iluuu1994
Copy link
Member

The differences are thus relative to sd and explained by the outliers shown in the charts above.

Ok, thanks. Unfortunately, the 1.1% improvement for Symfony is not very significant. We don't have much trouble with the other benchmarks. I'll investigate so see if the following is possible:

I do wonder whether it may be worth checking if the the code could be optimized by allocating a bit more memory and aligning it manually to make use of SSE. This would solve the benchmark and potentially improve performance. If this happens primarily in sqlite this may be out of our control.

@mvorisek mvorisek force-pushed the improve_bench branch 2 times, most recently from 035a222 to 9d77734 Compare January 26, 2024 17:49
@mvorisek mvorisek marked this pull request as ready for review January 26, 2024 17:58
@mvorisek
Copy link
Contributor Author

Dear maintainers, can you please review this PR?

It improves the collection to capture metrics per each script run to allow to apply robust statistics.

Except more robust/stable results, the benchmarking should work as before.

In following PR, I plan to measure also the startup/1st (compile) run multiple times to monitor that too. For that, I would like to land this PR first.

@iluuu1994
Copy link
Member

iluuu1994 commented Mar 20, 2024

I appreciate your effort on this. However, it didn't seem like this significantly improved stability in your last comment? I see that it improves Wordpress significantly more than Symfony. However, Wordpress was never a problem to begin with. See:

https://nielsdos.github.io/php-benchmark-visualisation/

Wordpress looks quite predictable.

image

Symfony has very large outliers, being ~0.5% faster. (Edit: Although tbh it has looked much better lately)

image

Please correct me if I'm wrong.

I wonder if it would help to run the benchmark without ASLR, and possibly without find_prefered_mmap_base(). This might help make memory layout more consistent, including libraries like sqlite. Unfortunately, I doubt this is going to be possible on GitHub Actions.

@mvorisek
Copy link
Contributor Author

mvorisek commented Mar 20, 2024

@iluuu1994 the improvement is much better - see https://3v4l.org/Ugnd8#v8.3.4

the numbers there were collected using GH Actions for 10. - 500. run (the first 10 runs were excluded to not skew the distributions, this is what the original impl. did with the "warmup iterations")

The absolute avg vs. median error is more than 50x better for JITted variants.

Can we agree median is much more better metric for benchmarking comparison/chart?

If yes, can this PR be reviewed so I know exactly what to do in order to be merged.

Thank you.

@iluuu1994
Copy link
Member

Turns out disabling ASLR is indeed possible. #13769 Let's see if that resolves the issue fully, or whether this PR is still necessary.

@mvorisek
Copy link
Contributor Author

mvorisek commented Mar 27, 2024

Disabling ASLR seems to have no improvement on stddev:

image

As presented in #13162 (comment) by the data, this PR have significant positive impact on the benchmark stddev. And, with individual measurements, statistics can be plot/measured easily.

@iluuu1994 iluuu1994 self-assigned this Mar 27, 2024
@iluuu1994
Copy link
Member

iluuu1994 commented Mar 27, 2024

@mvorisek Fair enough, I will have a look in a couple of weeks. Feel free to ping me in 2-3 weeks if you don't get a response.

@kocsismate
Copy link
Member

kocsismate commented Mar 28, 2024

Disabling ASLR seems to have no improvement on stddev:

I concur, I disabled ASLR in my benchmark as well, but neither I got a noticable improvement in std dev.

(the two benchmarks didn't test the same PHP commits though)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants